Focusing on the issue that feature selection for the usually encountered large scale data sets in the "big data" is too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.
Focusing on the issue that the label kernel functions do not take the correlation between labels into consideration in the multi-label feature extraction method, two construction methods of new label kernel functions were proposed. In the first method, the multi-label data were transformed into single-label data, and thus the correlation between labels could be characterized by the label set; then a new label kernel function was defined from the perspective of loss function of single-label data. In the second method, mutual information was used to characterize the correlation between labels, and a new label kernel function was proposed from the perspective of mutual information. Experiments on three real-life data sets using two multi-label classifiers demonstrated that the best method of all measures was feature extraction method with label kernel function based on loss function and the performance of five evaluation measures on average increased by 10%; especially on the data set Yeast, the evaluation measure Coverage reached a decline of about 30%. Closely followed by feature extraction method with label kernel function based on mutual information and the performance of five evaluation measures on average increased by 5%. The theoretical analysis and simulation results show that the feature extraction methods based on new output kernel functions can effectively extract features, simplify learning process of multi-label classifiers and, moreover, improve the performance of multi-label classification.
The access request for computer network has the characteristics of real-time and dynamic change. In order to detect network intrusion in real time and be adapted to the dynamic change of network access data, a real-time detection framework for network intrusion was proposed based on data stream. First of all, misuse detection model and anomaly detection model were combined. A knowledge base was established by the initial clustering which was made up of normal patterns and abnormal patterns. Secondly, the similarity between network access data and normal pattern and abnormal pattern was measured using the dissimilarity between data point and data cluster, and the legitimacy of network access data was determined. Finally, when network access data stream evolved, the knowledge base was updated by reclustering to reflect the state of network access. Experiments on intrusion detection dataset KDDCup99 show that, when initial clustering samples are 10000, clustering samples in buffer are 10000, adjustment coefficient is 0.9, the proposed framework achieves a recall rate of 91.92% and a false positive rate of 0.58%. It approaches the result of the traditional non-real-time detection model, but the whole process of learning and detection only scans network access data once. With the introduction of knowledge base update mechanism, the proposed framework is more advantageous in the real-time performance and adaptability of intrusion detection.